2022-12-15

The Basic Problem

  • Missing data is usual and not an exception.
  • Usually no description, analyis, or even acknowledgment of missing data.
  • That is bad and you shouldn’t do that 👨‍🏫.
  • Conclusions may (and do) change when missing data is accounted for.

Consequences of missing data:

  • lower power
  • bigger standard errors and confidence intervals
  • biased results

Steps in dealing missing data

  1. Identify the missing data.
  2. Examine (the causes of) the missing data.
  3. Deal with the missing data.

Unfortunately, (1) is the only unambiguous step.

For those really interested: Check out the standard work by Little & Rubin (2002).

A classification of missing data

Missing completely at random MCAR

  • the presence of missing data on a variable is unrelated to any other observed or unobserved variable
  • easy to handle, but rare

Missing at random MAR

  • the presence of missing data on a variable is related to other observed variables but not to its own unobserved value

Not missing at random NMAR

  • the presence of missing data is NOT systematic or predictable using the other information we have but also isn’t missing randomly (related to its own unobserved value)
  • analysis of NMAR data is complex (and beyond the scope of this course)
  • often the best way of dealing with this is to try to collect more information about why the data is missing

Let’s continue with an example

Data (same as in session 3)

Source: Górnik-Durose, Malgorzata E. Materialism & personality. Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributor], 2020-04-15. https://doi.org/10.3886/E101900V1

Publication: Górnik-Durose, M. E. (2020). Materialism and well-being revisited: The impact of personality. Journal of Happiness Studies: An Interdisciplinary Forum on Subjective Well-Being, 21(1), 305–326. https://doi.org/10.1007/s10902-019-00089-8

The data has already some missings, but we will add an extra 10% per variable.

set.seed(42)
mr <- 0.1
df <- purrr::map_df(df_mat, function(x) 
  {x[sample(c(TRUE, NA), prob = c(1-mr, mr), size = length(x), replace = TRUE)]})
df %>% head()
## # A tibble: 6 × 5
##   gender marital_status    materialism neuroticism life_satisfaction
##   <fct>  <fct>                   <dbl>       <dbl>             <dbl>
## 1 <NA>   <NA>                       27          35                19
## 2 <NA>   married                    23          NA                23
## 3 men    married                    30          30                NA
## 4 women  <NA>                       27          34                15
## 5 women  civil partnership          32          30                25
## 6 men    civil partnership          23          18                23

1. Identify the missing data.

Tabular

dv = 'life_satisfaction'
iv = c('gender', 'marital_status', 'materialism', 'neuroticism')
df %>% finalfit::ff_glimpse(dv, iv)
label var_type n missing_n missing_percent mean sd min quartile_25 median quartile_75 max
life_satisfaction life_satisfaction <dbl> 320 40 11.1 21.9 5.6 8.0 18.0 22.0 26.0 35.0
materialism materialism <dbl> 317 43 11.9 27.2 6.6 9.0 22.0 28.0 32.0 43.0
neuroticism neuroticism <dbl> 328 32 8.9 23.6 9.5 3.0 17.0 23.0 30.0 47.0
label var_type n missing_n missing_percent levels_n levels levels_count levels_percent
gender gender <fct> 317 43 11.9 2 “women”, “men”, “(Missing)” 215, 102, 43 60, 28, 12
marital_status marital_status <fct> 321 39 10.8 4 “single”, “civil partnership”, “married”, “other”, “(Missing)” 78, 93, 134, 16, 39 21.7, 25.8, 37.2, 4.4, 10.8

Graphical

df %>% finalfit::missing_plot() %>% ggplotly()

2. Examine (the causes of) the missing data.

Things you want to answer

  • What percentage of the data is missing?
  • Are the missing data concentrated in a few variables or widely distributed?
  • Do the missing values appear to be random?
  • Does the covariation of missing data with each other or with observed data suggest a possible mechanism that’s producing the missing values?

Visual Patterns

df %>% VIM::aggr(prop=FALSE, numbers=TRUE, cex.axis=.7 )

Visual Patterns

  • Are any variables or variable combinations missing more often?
  • Is there something special about them?
    • sensitive data
    • procedure (e.g. at the end of long questionnaire)

Associations between missing and observed data (graphical)

df %>% finalfit::missing_pairs(dv, iv)

Associations between missing and observed data (numerical)

df %>% finalfit::missing_compare(dv, iv)
Missing data analysis: life_satisfaction Not missing Missing p
gender women 197 (91.6) 18 (8.4) 0.044
men 85 (83.3) 17 (16.7)
marital_status single 68 (87.2) 10 (12.8) 0.045
civil partnership 90 (96.8) 3 (3.2)
married 115 (85.8) 19 (14.2)
other 15 (93.8) 1 (6.2)
materialism Mean (SD) 27.3 (6.5) 25.9 (6.6) 0.194
neuroticism Mean (SD) 23.7 (9.4) 22.8 (9.9) 0.617

Take p-values with a grain of salt (\(\alpha\)-inflation, not necessarily adequate test).

Omnibus test for testing MCAR

Little’s (1988) test statistic assesses if data is missing completely at random (MCAR). The null hypothesis in this test is that the data is MCAR, and the test statistic is a chi-squared value.

naniar::mcar_test(df)
## # A tibble: 1 × 4
##   statistic    df p.value missing.patterns
##       <dbl> <dbl>   <dbl>            <int>
## 1      77.0    55  0.0269               19

Rather rely on your understanding of the mechanism causing missings than on this test to decide whether your data is MCAR.

3. Deal with the missing data.

Approaches

  • rational
  • complete-case analysis (listwise deletion)
  • multiple imputation
  • other approaches
    • FIML
    • pairwise deletion
    • single imputation
    • simple (nonstochastic) imputation

Rational approach

  • use mathematical or logical relationships among variables to attempt to fill in or recover missing value
  • e.g.: gender from first name, age from birth year, income from job, …
  • typically requires creativity and thoughtfulness
  • data recovery may be exact or approximate

Complete-case analysis (listwise deletion)

  • default in most software packages (most often used)
  • if MCAR unbiased, but smaller sample size with all consequences (reduces statistical power)
  • if not MCAR biased/skewed results

Multiple imputation

  • 3 steps:
    • impute: distributions of the missing values are estimated (via regression techniques) and drawing from them multiple complete data sets (10-20 are typical) are created
    • analyze: analysis is done with each data set
    • pool: results from all data sets are pooled
  • frequently the method of choice for complex missing-values problems

Multiple imputation - Tips

  • keep auxiliary variables (neither DV nor IV) in data set (might help predict the missing values)
  • don’t be fancy: impute your DV and also keep it in the analysis (see Kontopantelis et al., 2017)
  • double check the scale of your variables and use adequate regression technique (are categorical defined as categorical?)

Multiple imputation - Impute

imp <- mice::mice(df, m = 5, seed = 42) # use m=20
summary(imp)
## Class: mids
## Number of multiple imputations:  5 
## Imputation methods:
##            gender    marital_status       materialism       neuroticism 
##          "logreg"         "polyreg"             "pmm"             "pmm" 
## life_satisfaction 
##             "pmm" 
## PredictorMatrix:
##                   gender marital_status materialism neuroticism
## gender                 0              1           1           1
## marital_status         1              0           1           1
## materialism            1              1           0           1
## neuroticism            1              1           1           0
## life_satisfaction      1              1           1           1
##                   life_satisfaction
## gender                            1
## marital_status                    1
## materialism                       1
## neuroticism                       1
## life_satisfaction                 0

Multiple imputation - Analyze

fit_mi <- with(imp, lm(life_satisfaction ~ neuroticism + materialism + marital_status))
(Intercept) neuroticism materialism marital_statuscivil partnership marital_statusmarried marital_statusother
30.77079 -0.2891596 -0.1074960 2.129492 0.4735927 1.1733647
30.29802 -0.3022146 -0.0803636 1.985887 0.6170638 0.8552175
30.55198 -0.2992326 -0.0852779 2.090393 0.1407699 1.9350182
29.78756 -0.2949250 -0.0728522 2.229981 0.6713768 2.3291168
29.76672 -0.2960791 -0.0465611 1.571953 0.0607582 -0.1899547

Multiple imputation - Pool

fit_mi %>% mice::pool() %>% summary()
##                              term    estimate  std.error   statistic        df
## 1                     (Intercept) 30.23501486 1.33748086  22.6059422 126.37935
## 2                     neuroticism -0.29632219 0.02879890 -10.2893594 307.86669
## 3                     materialism -0.07851016 0.04829938  -1.6254901  51.59514
## 4 marital_statuscivil partnership  2.00154107 0.74062156   2.7025153 118.79279
## 5           marital_statusmarried  0.39271229 0.71530036   0.5490173  85.97676
## 6             marital_statusother  1.22055252 1.61487257   0.7558197  18.33940
##        p.value
## 1 3.100257e-46
## 2 1.549380e-21
## 3 1.101518e-01
## 4 7.890119e-03
## 5 5.844166e-01
## 6 4.593560e-01

Combine bootstrapping with MI

create bootstrapped & MI data sets - Number of data sets = nBoot*nImp

boot_imp <- bootImpute::bootMice(df, nBoot=100, nImp=5, seed=42) # use >= 1000 & 20

analyze and pool

# wrapper to analyze a data set
get_coefficients <- function(df) {
  coef(lm(life_satisfaction ~ neuroticism + materialism + marital_status, df))}
fit_boot_mi <- bootImpute::bootImputeAnalyse(boot_imp, get_coefficients)
##                                    Estimate Std. error 95% CI lower
## (Intercept)                     30.34609707 1.21458193   27.9318816
## neuroticism                     -0.29922530 0.03408520   -0.3669206
## materialism                     -0.07741992 0.05035722   -0.1774608
## marital_statuscivil partnership  1.93145144 0.69844148    0.5435935
## marital_statusmarried            0.20961544 0.67871347   -1.1396123
## marital_statusother              0.87832362 1.04708233   -1.2026007
##                                 95% CI upper            p
## (Intercept)                      32.76031252 2.162110e-41
## neuroticism                      -0.23153003 8.406916e-14
## materialism                       0.02262095 1.276937e-01
## marital_statuscivil partnership   3.31930939 6.916982e-03
## marital_statusmarried             1.55884320 7.581880e-01
## marital_statusother               2.95924789 4.038431e-01

For more information see e.g. Bartlett and Hughes (2020).

Other approaches

FIML (full information maximum likelihood)

  • equivalent results to MI (see e.g. Li & Shi, 2021)
  • only within structural equation modeling

Pairwise deletion

  • different analyses use different samples (what’s the sample size?)
  • indefinite correlation matrix

Single imputation

  • = MI with m=1 (no pooling)
  • underestimates Standard Errors

Simple (nonstochastic) imputation

  • one value for all
  • e.g. mean or median for continuous variables
  • own missing class for categorical variables
  • biased results if not MCAR
  • underestimates SEs